traffic = read.csv("DenTraffic2019ToModel.csv")
traffic$SEGMDIR = as.factor(traffic$SEGMDIR)
traffic$FUNCCLASSI = as.factor(traffic$FUNCCLASSI)
traffic = subset(traffic, select = -c(X))
interaction of road size and number of lanes
mod1 = lm(AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) + SURFWD + PSI + THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
summary(mod1)
##
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) +
## SURFWD + PSI + THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23351 -2975 -712 2125 34405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16587.797 999.726 16.592 < 2e-16 ***
## SEG_LENGTH 724.590 1542.034 0.470 0.63845
## I(SEGMDIR)N -48.911 216.170 -0.226 0.82101
## I(SEGMDIR)NE -2667.398 305.477 -8.732 < 2e-16 ***
## I(SEGMDIR)NW -2367.164 371.460 -6.373 2.02e-10 ***
## I(SEGMDIR)S 1331.897 233.969 5.693 1.32e-08 ***
## I(SEGMDIR)SE 105.598 439.563 0.240 0.81016
## I(SEGMDIR)SW 1623.074 1204.315 1.348 0.17781
## I(SEGMDIR)W 598.901 213.841 2.801 0.00512 **
## I(FUNCCLASSI)4 Minor Arterial -12515.773 228.678 -54.731 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16972.644 253.597 -66.928 < 2e-16 ***
## SURFWD 37.523 4.252 8.824 < 2e-16 ***
## PSI -63.672 73.479 -0.867 0.38624
## THRULNQTY 2431.109 91.211 26.654 < 2e-16 ***
## THRULNWD -64.456 83.470 -0.772 0.44003
## RUNLENGTH1 -118.780 78.603 -1.511 0.13082
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5190 on 5102 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7253
## F-statistic: 901.8 on 15 and 5102 DF, p-value: < 2.2e-16
The only features for which we don’t have overwhelming evidence are effective in predicting \(\verb|AADT|\) are \(\verb|SEG_LENGTH|\) , some directions of \(\verb|SEGDIR|\) , \(\verb|PSI|\) , \(\verb|THRULNWD|\) and \(\verb|RUNLENGTH1|\) . After a quick collinearity check, we will systematically remove the variables one at a time.
This checks for collinearity in our features. This statistic is on a scale that starts at 1, and lower values are indicative of no collinearity.
vif(mod1)
## GVIF Df GVIF^(1/(2*Df))
## SEG_LENGTH 1.076254 1 1.037427
## I(SEGMDIR) 1.365241 7 1.022487
## I(FUNCCLASSI) 1.742328 2 1.148901
## SURFWD 1.375148 1 1.172667
## PSI 1.054805 1 1.027037
## THRULNQTY 2.051504 1 1.432307
## THRULNWD 1.087260 1 1.042717
## RUNLENGTH1 1.107105 1 1.052191
None of these features are of concern. This supports the conclusion we started to come to with the correlation heat-map to keep each feature in our first model.
Based off our the significant levels from our model, we will run an ANOVA test to see if the features just mentioned help the model enough to keep them in. We will omit them in if the p-score associated with their F-statistic (for variables with factors) or p-value (for numeric variables) is greater than 0.10.
mod1sansDIR = update(mod1,.~.-I(SEGMDIR))
anova(mod1sansDIR,mod1)
## Analysis of Variance Table
##
## Model 1: AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI + THRULNQTY +
## THRULNWD + RUNLENGTH1
## Model 2: AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) + SURFWD + PSI +
## THRULNQTY + THRULNWD + RUNLENGTH1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5109 1.4289e+11
## 2 5102 1.3743e+11 7 5456470847 28.939 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An F-statistic of 20.388 and p-value of 2.2e-16 indicates we should keep \(\verb|SEGMDIR|\) in the model. This warrants more discussion though, as many of the individual directions didn’t have a significant impact on the model. There isn’t any science to support that the direction of travel would affect the average daily traffic. Furthermore, in conjunction with our visualization of the miles of road in each direction in Denver, there is evidence to suggest this data is not useful for predicting \(\verb|AADT|\) . Accordingly, we will remove it. H
mod1 = mod1sansDIR
summary(mod1)
##
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI +
## THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22953 -2985 -717 2464 35516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18365.327 989.925 18.552 < 2e-16 ***
## SEG_LENGTH 1698.624 1548.703 1.097 0.27278
## I(FUNCCLASSI)4 Minor Arterial -12479.458 231.137 -53.992 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16784.120 255.760 -65.624 < 2e-16 ***
## SURFWD 40.171 4.289 9.367 < 2e-16 ***
## PSI -13.546 73.638 -0.184 0.85406
## THRULNQTY 2332.657 90.694 25.720 < 2e-16 ***
## THRULNWD -263.928 82.337 -3.205 0.00136 **
## RUNLENGTH1 -14.140 78.281 -0.181 0.85667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5288 on 5109 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.7148
## F-statistic: 1604 on 8 and 5109 DF, p-value: < 2.2e-16
So without \(\verb|SEGMDIR|\) the largest p-value is \(\verb|RUNLENGTH1|\). So let’s see how the model is without it.
mod1 = update(mod1,.~.-RUNLENGTH1)
summary(mod1)
##
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI +
## THRULNQTY + THRULNWD, data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22954 -3002 -717 2469 35528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18341.672 981.132 18.694 < 2e-16 ***
## SEG_LENGTH 1700.344 1548.527 1.098 0.27224
## I(FUNCCLASSI)4 Minor Arterial -12476.347 230.473 -54.134 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16775.285 251.016 -66.830 < 2e-16 ***
## SURFWD 40.156 4.287 9.366 < 2e-16 ***
## PSI -14.638 73.382 -0.199 0.84190
## THRULNQTY 2334.305 90.225 25.872 < 2e-16 ***
## THRULNWD -263.679 82.317 -3.203 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5288 on 5110 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.7149
## F-statistic: 1834 on 7 and 5110 DF, p-value: < 2.2e-16
So without \(\verb|RUNLENGTH1|\) the largest p-value is \(\verb|PSI|\). So let’s see how the model is without it.
mod1 = update(mod1, .~.-PSI)
summary(mod1)
##
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + THRULNQTY +
## THRULNWD, data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22941 -3002 -718 2473 35516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18283.673 936.969 19.514 < 2e-16 ***
## SEG_LENGTH 1714.346 1546.790 1.108 0.26777
## I(FUNCCLASSI)4 Minor Arterial -12476.450 230.450 -54.139 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16776.388 250.931 -66.856 < 2e-16 ***
## SURFWD 40.085 4.272 9.382 < 2e-16 ***
## THRULNQTY 2335.030 90.144 25.903 < 2e-16 ***
## THRULNWD -262.899 82.217 -3.198 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5287 on 5111 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.7149
## F-statistic: 2140 on 6 and 5111 DF, p-value: < 2.2e-16
So without \(\verb|PSI|\) the largest p-value is \(\verb|SEG_LENGTH|\). So let’s see how the model is without it.
mod1 = update(mod1, .~.-SEG_LENGTH)
summary(mod1)
##
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD,
## data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22847 -3016 -735 2465 35441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18372.956 933.521 19.681 < 2e-16 ***
## I(FUNCCLASSI)4 Minor Arterial -12435.033 227.405 -54.682 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16734.328 248.051 -67.463 < 2e-16 ***
## SURFWD 39.856 4.267 9.339 < 2e-16 ***
## THRULNQTY 2336.661 90.134 25.924 < 2e-16 ***
## THRULNWD -263.073 82.219 -3.200 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5288 on 5112 degrees of freedom
## Multiple R-squared: 0.7152, Adjusted R-squared: 0.7149
## F-statistic: 2567 on 5 and 5112 DF, p-value: < 2.2e-16
This looks good! Each of the p-values are well below our alpha of 0.10. So our first linear model is
\[ \widehat {\verb|AADT|} = 18844 - 12353 \cdot \verb|I|\verb|(Class=4|) -16705 \cdot \verb|I|\verb|(Class=5|) + 44\cdot \verb|SURFWD| + 2167\cdot \verb|THRULNQTY| -294\cdot \verb|THRULNWD| \]
We want to show that a linear model is a good type of model for this interaction. On the residual plot below, this is indicated by the points being symmetrically distributed above and below the value of 0. Furthermore, the residual points should be uniform and not fan out to ensure our model has equal variance of residuals.
plot(mod1$residuals)
Although this isn’t perfect, there isn’t anything so egregious to justify not using a linear model.
We also want our model to be normally distributed, which on the Q-Q plot below would look like a straight 45 degree line.
plot(mod1,which=2)
Like before, this isn’t perfect, but given that there are over 6000 observations, it is not unexpected to see at least some theoretical quantiles fall below -4 or above 4.
outliers = outlierTest(mod1)
subset(traffic,rownames(traffic)%in%names(outliers$rstudent))
## SEG_LENGTH ROUTENAME SEGMDIR FUNCCLASSI SURFWD
## 141 0.06 W 6TH AVENUE FWY W 3 Principal Arterial - Other 76
## 484 0.01 N PEORIA ST N 4 Minor Arterial 48
## 510 0.03 N PEORIA ST N 4 Minor Arterial 48
## 761 0.01 W 6TH AVENUE FWY W 3 Principal Arterial - Other 76
## 815 0.01 W 6TH AVENUE FWY W 3 Principal Arterial - Other 20
## 943 0.04 W 6TH AVENUE FWY W 3 Principal Arterial - Other 20
## 2432 0.10 E 1ST AVE E 3 Principal Arterial - Other 36
## 3171 0.08 W 6TH AVENUE FWY W 3 Principal Arterial - Other 84
## 3810 0.02 W 6TH AVENUE FWY W 3 Principal Arterial - Other 20
## 5111 0.04 W 6TH AVENUE FWY W 3 Principal Arterial - Other 20
## PSI THRULNQTY THRULNWD AADT AADTDERIV RUNLENGTH1
## 141 1 6 12 66000 3 0.26
## 484 4 4 12 40000 3 0.41
## 510 4 4 12 40000 3 0.40
## 761 1 6 12 66000 3 0.20
## 815 4 6 10 66000 3 0.31
## 943 4 6 10 66000 3 0.19
## 2432 4 3 12 52000 0 0.10
## 3171 4 6 10 66000 3 0.08
## 3810 4 6 10 66000 3 0.10
## 5111 4 6 10 66000 3 0.30
So our only outlines were 1st, 6th, and Peoria Street. These are all larger streets. 1st is particularly impressive, with only 3 lanes and an \(\verb|AADT|\) of 52000. In an effort to better model smaller streets, we acknowledge that these outliers are skewing the linear model and affecting the model. We will also try a model without these observations for thoroughness.
traffic = subset(traffic,!rownames(traffic)%in%names(outliers$rstudent))
mod2 = lm(AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD , data = traffic)
summary(mod2)
##
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD,
## data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22366.8 -3004.1 -697.6 2407.4 25043.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18844.542 900.267 20.932 < 2e-16 ***
## I(FUNCCLASSI)4 Minor Arterial -12353.450 219.175 -56.363 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16705.626 238.951 -69.912 < 2e-16 ***
## SURFWD 44.674 4.124 10.832 < 2e-16 ***
## THRULNQTY 2167.743 87.276 24.838 < 2e-16 ***
## THRULNWD -294.249 79.288 -3.711 0.000209 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5091 on 5102 degrees of freedom
## Multiple R-squared: 0.7235, Adjusted R-squared: 0.7232
## F-statistic: 2670 on 5 and 5102 DF, p-value: < 2.2e-16
As expected, removing these few points didn’t actually change that much with our correlation coefficients nor the associated statistics.
\[ \widehat {\verb|AADT|} = 18844 - 12353 \cdot \verb|I|\verb|(Class=4|) -16705 \cdot \verb|I|\verb|(Class=5|) + 44\cdot \verb|SURFWD| + 2167\cdot \verb|THRULNQTY| -294\cdot \verb|THRULNWD| \]
It is also important to consider how features interact with each other. For example, dividing the width of lanes by the inverse of lane quantity can be considered some sort of density. Or, given that there is a lot of time spent in acceleration, and deceleration, it would be understandable if the length of the run had a cubic interaction to account for saved time and ease of flow on longer stretches.
mod3 = lm(AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD + THRULNWD/THRULNQTY + I(RUNLENGTH1^3), data = traffic)
summary(mod3)
##
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD +
## THRULNWD/THRULNQTY + I(RUNLENGTH1^3), data = traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22639.1 -2945.0 -713.3 2359.9 24877.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23756.682 2139.413 11.104 < 2e-16 ***
## I(FUNCCLASSI)4 Minor Arterial -12434.206 220.533 -56.383 < 2e-16 ***
## I(FUNCCLASSI)5 Major Collector -16875.499 245.288 -68.799 < 2e-16 ***
## SURFWD 45.546 4.147 10.982 < 2e-16 ***
## THRULNQTY 321.932 751.792 0.428 0.668509
## THRULNWD -733.181 196.884 -3.724 0.000198 ***
## I(RUNLENGTH1^3) -9.346 4.366 -2.141 0.032339 *
## THRULNQTY:THRULNWD 169.734 69.463 2.444 0.014579 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5086 on 5100 degrees of freedom
## Multiple R-squared: 0.7241, Adjusted R-squared: 0.7237
## F-statistic: 1912 on 7 and 5100 DF, p-value: < 2.2e-16
\[ \widehat {\verb|AADT|} = 23756 - 12434 \cdot \verb|I|\verb|(Class=4|) -16875 \cdot \verb|I|\verb|(Class=5|) + 45\cdot \verb|SURFWD| + 321\cdot \verb|THRULNQTY| + 169 \cdot \frac{\verb|THRULNWD|}{\verb|THRULNQTY|} -733\cdot \verb|THRULNWD| -9\cdot \verb|RUNLENGTH|^3 \]
Including these interactions made the model more convoluted and difficult to predict with. Furthermore, including them greatly changed the intercept and the coefficient for \(\verb|THRULNQTY|\). So it is questionable that this would be a better model in practice. Accordingly, we propose the use of the second model generally, and it will be the one used in latter parts of the project.
trafficTest = read.csv("DenTraffic2019Test.csv")
y_pred = predict(mod2, type = 'response', newdata = trafficTest)
MAPE(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.753313
MSE(y_pred=y_pred,y_true = trafficTest$AADT)
## [1] 30168916
NormalizedGini(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.823578
goodPreds = y_pred > .9*trafficTest$AADT & y_pred < 1.1*trafficTest$AADT
sum(goodPreds)/length(y_pred)
## [1] 0.1936219
Hmm these results don’t seem too promising. It looks like only about 18% of predictions were within 10% of the real world value. Furthermore, an mean absolute percentage error of 75% indicates that the model does quite poorly, but the normalized GINI coefficient of 0.82 indicates that our model does well given its circumstances. These values shouldn’t change much for our different models though, so we’ll only show the statistics for the 3th model.
y_pred = predict(mod3, type = 'response', newdata = trafficTest)
MAPE(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.7474903
MSE(y_pred=y_pred,y_true = trafficTest$AADT)
## [1] 30193413
NormalizedGini(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.8231322
goodPreds = y_pred > .9*trafficTest$AADT & y_pred < 1.1*trafficTest$AADT
sum(goodPreds)/length(y_pred)
## [1] 0.2095672
Nothing changes that much. Our models are very similar and had similar \(R^2\) scores so it’s very understandable that they performed similarly.
saveRDS(mod1,"model1Full")
saveRDS(mod2,"model2Thin")
saveRDS(mod3,"model3Inter")
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
fig <- plot_ly(traffic, x = ~AADT, y = ~SURFWD, z = ~THRULNQTY, type = 'scatter3d',color=traffic$FUNCCLASSI, opacity = 1, mode= "marker")
fig